Choose the language:¶

🇧🇷 Português     🇺🇸 English

Exam Taker - a simple analysis of preliminary test results and some personal reflections¶

Don't worry about the codes you'll see here; they are part of the effort to achieve results. Simply read the text and analyze the tables and graphs. Everything is explained according to the results the codes present.


On October 29, 2023, a rainy Sunday, a test was held for the position of administrative technician according to Notice No. 151/2023-PRH. The test took place at the nostalgic State University of Maringá. *I say nostalgic because it was there that I obtained my Bachelor's degree in Music a few years ago, and it was a good time, ah!

I honestly thought I would do well on this test, and that was one of the factors that made me extremely nervous, considering that I had come close but missed the mark in previous exams. However, this time, the subjects were ones I was familiar with. *I mentioned the rainy day, not that it's relevant to the work you're about to see, but because I want to emphasize that this aspect worsened my nervousness due to the traffic we had to face (on that day, my wife drove), even though we had ample time to arrive... well, I don't need to say much more. If you're an exam taker or have been through exams that could decide something crucial for you, even in school, then you'll understand.

I took the test, and Portuguese was a bit challenging, you know? *I usually do well in Portuguese, but I knew this time it wouldn't be my strong suit. However, I was sure I aced the rest, except for mathematics, which is my weakest point (as we'll see in the analyses I did here).

The preliminary result of the objective test score disclosure – after the appeal period, where candidates, if possible, would metaphorically challenge the exam board in hopes of changing the answer key for a question – was released on November 22, 2023, in Notice No. 351/2023-PRH - Objective Test Result. *With this result in hand, I thought:

"Well, since I'm learning data analysis, I'll practice some tools, ask for help from ChatGPT, and do a simple analysis of the preliminary position in which each candidate might be based on the data obtained in the notice with preliminary results. *Who knows, maybe I'll compare a few things here and there?"

And thus, this work is born.

I won't say I understand every line of code present here exactly at the project delivery moment because my focus was on the analysis, on telling a story I already had in mind. But I can say I have enough tools to study and understand what's here, as everything presented comes from a scope where I've studied the basics and know how to search. So, this project serves as a study tool. I can reverse engineer points I don't master and learn from what I've done.

What helped me the most in achieving this were the courses (which I took through the "financial aid application") on the COURSERA - IBM Applied AI Professional Certificate site (among the courses in this track, I've already obtained 4 learning certificates), as well as free courses on prompt engineering from Udemy (for dealing with ChatGPT) and on YouTube. These courses teach how to extract the best from the natural language processing tool that is "GPT."

I can't claim I'm correct in everything I did here; this is an attempt and an accomplishment to demonstrate what I could analyze at a first glance. It's my way of telling a story about the preliminary performance of the candidates, and I apologize for any inconsistencies. I also rely on the help of those who know more than I do to improve my techniques.

Regarding the data, of course, during the journey in this project, I thought:

"Wow! If I had access to each alternative that each candidate answered for each question, I could identify more specific strengths and weaknesses."

Imagine if we could also know what each candidate has for breakfast, who prepares their meals, if they work, if they only study; how much could we extract? But, well, I leave this task to those who sought or "made" these data (if they exist) and will surely handle them better than I, a mere aspiring data analyst.

Given these considerations, I really would like to say that I wrote this introduction last in the project. I knew exactly what I was going to do before starting all of this, but I didn't know exactly how it would turn out. However, the ending was what I expected.

Thank you for embarking on this journey with me!

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
%pip install nbformat --upgrade
Requirement already satisfied: nbformat in c:\users\eflov\anaconda3\lib\site-packages (5.9.2)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: fastjsonschema in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (2.16.2)
Requirement already satisfied: jsonschema>=2.6 in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (4.17.3)
Requirement already satisfied: jupyter-core in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (5.3.0)
Requirement already satisfied: traitlets>=5.1 in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (5.7.1)
Requirement already satisfied: attrs>=17.4.0 in c:\users\eflov\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat) (22.1.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\eflov\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat) (0.18.0)
Requirement already satisfied: platformdirs>=2.5 in c:\users\eflov\anaconda3\lib\site-packages (from jupyter-core->nbformat) (2.5.2)
Requirement already satisfied: pywin32>=300 in c:\users\eflov\anaconda3\lib\site-packages (from jupyter-core->nbformat) (305.1)

Test Program¶

ANNEX I TO NOTICE Nº151/2023-PRH

Let's look at the topics covered in the test:

PORTUGUESE LANGUAGE

  1. Comprehension and interpretation of texts.
  2. Meaning of words (synonyms, antonyms, homonyms, paronyms).
  3. Essential terms of the sentence (subject, predicate).
  4. Integrating terms of the sentence (verbal complements [direct and indirect object]; nominal complements).
  5. Accessory terms of the sentence (adnominal adjunct, adverbial adjunct, appositive, vocative).
  6. Verbal and nominal agreement.
  7. Verbal and nominal regency.
  8. Crasis.
  9. Punctuation.
  10. Pronominal placement.
  11. Accentuation.
  12. Spelling.
  13. Use of "whys."

MATHEMATICS

  1. Ratios and proportions.
  2. Proportional division.
  3. Simple rule of three.
  4. Percentage.
  5. Applications of logical inference rules.

BASIC COMPUTER SKILLS

  1. BASIC CONCEPTS: Hardware and Software; operating systems; application programs; computer networks; Internet; email; file transfer; mobile devices, concepts of information security.
  2. BASIC CONFIGURATION NOTIONS: components, peripherals, processors, external devices. Concepts of organization and management of information, files, folders, and programs. Storage, Input, and Output Devices.
  3. EDITORS: Knowledge of Word Processors, Spreadsheets, Presentation Editors.

BASIC KNOWLEDGE OF LEGISLATION AND STATUTE OF CHILDREN AND ADOLESCENTS

  1. Federal Constitution: Title II - Chapter I - Individual and Collective Rights and Duties (art. 5 and its paragraphs); Title III - Chapter VII - Public Administration (art. 37 to 41).
  2. Constitution of the State of Paraná - Title II - Public Administration - Chapters I and II (arts. 27 to 43).
  3. Statute of Public Servants of the State of Paraná (State Law No. 6,174/1970).
  4. Statute of Children and Adolescents (Law No. 8,069/1990)

Basic Data Reading¶

Here, I am just performing some basic data readings without any analyses or insights.

In [2]:
file = r'D:\Jeanco\Meus projetos\notas_prova_uem\resultado_final_csv.csv'
In [3]:
# Data presented with all candidate names, subjects, etc.
df = pd.read_csv(file, skiprows=1)
df
Out[3]:
INSCRIÇÃO CANDIDATO L.PORT MAT INF O LEG EST NOTA
0 4859 Abgail de Souza 1,50 0,25 1,00 0,25 0,50 3,50
1 4751 Ábia Morais Silva 1,25 0,00 0,50 0,75 0,50 3,00
2 12493 Abinair Trindade Miranda Valerio 1,00 0,25 0,25 0,50 0,50 2,50
3 17346 Abner Fellipe Beliato 1,25 0,00 0,75 0,25 0,50 2,75
4 13377 Abner Mariano 2,75 1,00 1,00 2,00 0,50 7,25
... ... ... ... ... ... ... ... ...
9699 10446 Zildeni Nunes Acacio Cassidori 1,25 0,00 0,00 0,75 0,25 2,25
9700 19599 Zilma Fernandes dos Santos Soares 1,75 0,25 0,00 1,25 0,50 3,75
9701 12953 Zingara Facco Rodrigues 1,00 0,25 0,25 0,75 0,50 2,75
9702 8830 Zuleica Silva Marques de Lima 2,75 0,50 0,75 0,50 0,50 5,00
9703 4453 Zuleika Shiraishi Kagueyama 1,25 0,50 0,75 1,00 0,50 4,00

9704 rows × 8 columns

Below, I will perform operations to rename the columns to facilitate data manipulation.

In [4]:
# Rename Columns
df.columns = ['Inscricao', 'Candidato', 'Port', 'MAT', 'INF', 'LEG', 'ECA', 'Nota Final']

# Standardize names to lowercase and separate by _
df.columns = df.columns.str.lower().str.replace(' ', '_')
In [5]:
# Viewing the data with renamed columns
df
Out[5]:
inscricao candidato port mat inf leg eca nota_final
0 4859 Abgail de Souza 1,50 0,25 1,00 0,25 0,50 3,50
1 4751 Ábia Morais Silva 1,25 0,00 0,50 0,75 0,50 3,00
2 12493 Abinair Trindade Miranda Valerio 1,00 0,25 0,25 0,50 0,50 2,50
3 17346 Abner Fellipe Beliato 1,25 0,00 0,75 0,25 0,50 2,75
4 13377 Abner Mariano 2,75 1,00 1,00 2,00 0,50 7,25
... ... ... ... ... ... ... ... ...
9699 10446 Zildeni Nunes Acacio Cassidori 1,25 0,00 0,00 0,75 0,25 2,25
9700 19599 Zilma Fernandes dos Santos Soares 1,75 0,25 0,00 1,25 0,50 3,75
9701 12953 Zingara Facco Rodrigues 1,00 0,25 0,25 0,75 0,50 2,75
9702 8830 Zuleica Silva Marques de Lima 2,75 0,50 0,75 0,50 0,50 5,00
9703 4453 Zuleika Shiraishi Kagueyama 1,25 0,50 0,75 1,00 0,50 4,00

9704 rows × 8 columns

Quickly and more clearly, let's also take a look at how many rows and columns are in this table.

In [6]:
# Checking the number of rows and columns
df.shape
Out[6]:
(9704, 8)

The first value in parentheses indicates the number of rows, and the second value indicates the number of columns.

Now I'll check the data types, which is a common practice in data analysis. Basically, I need to know if I'm dealing with numbers or text, but this is a very simplified explanation, so don't worry about this part.

In [7]:
# Observing Data Types
df.dtypes
Out[7]:
inscricao      int64
candidato     object
port          object
mat           object
inf           object
leg           object
eca           object
nota_final    object
dtype: object

The results are above and not quite the data types I expect, as I mentioned. Don't worry about this part. I will perform the conversion of data to the formats that I expect them to have.

In [8]:
# Converting Data Types
cols_to_convert = ['port', 'mat', 'inf', 'leg', 'eca', 'nota_final']

# Function to convert values to float or keep unchanged if not possible
def convert_to_float(value):
    try:
        return float(value.replace(',', '.'))  # Replace comma with dot to avoid issues
    except (ValueError, AttributeError):  # Handle AttributeError when encountering non-string values
        return value

# Apply the function only to specific columns
for column in cols_to_convert:
    df[column] = df[column].apply(convert_to_float)

I'll explain the next step in a way that a non-expert can understand. A quick check is being performed to see if there are any null values in each column. As we can see, there are "0" (zero) null values, meaning there are no columns with "missing data." To make it even clearer, take the "candidate" column as an example and see that there is a '0' there... this indicates that there are indeed values in every row of the "candidate" column (zero missing values, to be even more redundant).

In [9]:
# Checking for any null values in the table
df.isnull().sum()
Out[9]:
inscricao     0
candidato     0
port          0
mat           0
inf           0
leg           0
eca           0
nota_final    0
dtype: int64

Basic Statistics¶

Let's "write" a table that presents basic statistical data. Consider that each column represents a specific subject of the test, and the final grade (sum of all grades in each subject) is in the last column. Basically, the first column where we see count, mean, std, etc. provides some information, of which I want to highlight count, mean, min, max.

  • count - is the number of rows in the table. We can say that each column has this number of rows, so we can conclude that this is the number of candidates who took the test.
  • mean - is the average achieved in all columns. The "port" column, for example, shows the average 1.922352, which is the average result of all grades for the Portuguese subject.
  • min - presents the minimum value for each column. The "mat" column, for example, shows the value "0.000000," meaning the minimum grade in mathematics was zero.
  • max - presents the maximum value for each column. The "mat" column, for example, shows the value "1.25," meaning the maximum grade in mathematics was "1.250000."
In [10]:
# Table to Present Basic Statistics such as Mean, Median, Maximum, Minimum, etc.
df[['port','mat','inf','leg','eca','nota_final']].describe()
Out[10]:
port mat inf leg eca nota_final
count 9704.000000 9704.000000 9704.000000 9704.000000 9704.000000 9704.000000
mean 1.922352 0.420213 0.770945 0.954941 0.428329 4.496780
std 0.748732 0.335230 0.312252 0.444426 0.130710 1.383657
min 0.250000 0.000000 0.000000 0.000000 0.000000 0.250000
25% 1.250000 0.250000 0.500000 0.750000 0.250000 3.500000
50% 1.750000 0.250000 0.750000 1.000000 0.500000 4.500000
75% 2.500000 0.750000 1.000000 1.250000 0.500000 5.500000
max 4.500000 1.250000 1.250000 2.000000 0.500000 9.000000

Take some time to analyze the table above. Look at the "nota_final" column, for example. For this column, the "mean" (average) is 4.4, which is a relatively low grade. We had 9704 candidates, and the total value of the test was 10.0. There is much more information that can be extracted from this simple statistical table, but it goes beyond the scope of this project.

In [11]:
# Average of Subjects
subjects = ['port', 'mat', 'inf', 'leg', 'eca']
means = df[subjects].mean()
print(means)
port    1.922352
mat     0.420213
inf     0.770945
leg     0.954941
eca     0.428329
dtype: float64

Before continuing, I will create a copy of the original data, just for safety, in case I need to perform any manipulation that could alter the original data. For cases like this, I will use the created copy:

In [12]:
# Create a General Copy of the Original DataFrame
df_copy = df.copy()

As I mentioned, I found the Portuguese test quite challenging, folks. When I came across that lengthy 170-line text, full of peculiar terms like "de chofre" and "palejam," I confess I felt a bit out of my comfort zone. The writing style, more akin to a college entrance exam than a job competition, made the task even more complex in my opinion.

But now, having observed the averages, an interesting realization: the subject in which I struggled the most was surprisingly the one that had the highest overall average. Portuguese Language leads, even with the complexity of the text.

However, let's analyze an important point: the number of questions. If we consider the distribution of grades according to the number of questions in each subject, we observe that grades for test-related topics were distributed as follows:

Consider that the grades for test-related topics were distributed as follows:

SUBJECT QUANTITY * VALUE = TOTAL VALUE OF QUESTIONS
Portuguese Language 20 x 0.25 = 5.00
Mathematics 05 x 0.25 = 1.25
Basic Computer Skills 05 x 0.25 = 1.25
Basic Knowledge of Legislation 08 x 0.25 = 2.00
Statute of Children and Adolescents 02 x 0.25 = 0.50
Total Questions 40 x 0.25 = 10.0

Here we have something to consider, and looking at the quantity of questions in percentage terms reveals other insights:

Let's use approximate values to make visualization easier.

Portuguese Language has 20 questions, each worth 0.25 points, totaling 5 points. The average in this subject was 1.9, which corresponds to 7.6 questions. Therefore, it can be interpreted as an average of 38% of the total possible questions. On the other hand, the ECA subject has 2 questions, each worth 0.25 points, totaling 0.5 points. The average for this subject was approximately 0.4, which corresponds to 1.68 questions, representing 84% of the total possible questions.

Now, when we directly compare these percentages, is it evident that the average in ECA is higher than in Portuguese Language? Nooooo, what is evident is that the number of correct answers is higher, but not the average. This percentage approach takes into account the total possible scale of correct answers in each subject and offers a more intuitive understanding of students' performance in relation to the number of questions in each discipline.

So, going back to "the averages," the average of 1.9 in Portuguese Language is low, considering exclusively the total possible points (5.0), but it remains the highest average in the dataset.

Interestingly - or not -, I made six mistakes in this subject, making errors in transferring to the final answer sheet, and I also made two mistakes in mathematics. Even with this personal challenge, the analysis suggests that, in a way, students were better prepared for Portuguese Language than I initially perceived, but only when considering the highest average among all subjects.


Now, let's talk a bit more about the "guessing." The idea that the higher average in Portuguese Language is due to guessing does not seem to hold. This is because, even when guessing, the probability of getting all questions right in a longer test will be lower than getting all right in a shorter test. This is because the probability of getting each question right is independent, and the probability of getting all questions right decreases as the number of questions increases. Therefore, it is not correct to state that, by guessing all questions, we would have a higher chance of obtaining a higher average in a longer test than in a shorter test. Although the randomness of guessing can lead to different results, the overall probability of achieving a higher average is not necessarily greater in longer tests. Each question has an independent probability of being answered correctly by chance, and this probability is constant regardless of the total number of questions.

Well, in case it's still not clear, let's visualize it with a graph:

In [13]:
# Bar Chart Function
def create_bar_chart(ax, labels, values, max_values, colors, title, xlabel, ylabel, legend_label):
    # Iterate over subjects
    for i, label in enumerate(labels):
        max_value = max_values[i]
        value = values[i]

        # Add a bar up to the maximum possible grade
        ax.bar(label, max_value, color='lightgray', label=legend_label if i == 0 else "")

        # Add a filled bar up to the height corresponding to the average
        ax.bar(label, value, color=colors[i])

        # Add annotation on the maximum possible grade
        ax.annotate(f'{max_value:.2f}', xy=(label, max_value), xytext=(0, 5),
                    textcoords='offset points', ha='center', va='bottom', color='black', fontsize=8)

    # Add labels and title
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)

    # Add legend
    ax.legend()

# Data from describe
data = {
    'Subject': ['Portuguese', 'Math', 'CompSci', 'Legislation', 'ECA'],
    'Average': [1.922352, 0.420213, 0.770945, 0.954941, 0.428329],
}

# Data for the number of questions and maximum possible grade per subject
questions_max = {
    'Subject': ['Portuguese', 'Math', 'CompSci', 'Legislation', 'ECA'],
    'Number of Questions': [20, 5, 5, 8, 2],
    'Maximum Possible Grade': [5.00, 1.25, 1.25, 2.00, 0.50],
}

# Bar Chart for Averages
fig, ax = plt.subplots(figsize=(10, 5))
create_bar_chart(ax, data['Subject'], data['Average'], questions_max['Maximum Possible Grade'],
                 ['blue', 'orange', 'green', 'red', 'purple'], 'Average per Subject', 'Subject', 'Grade', 'Maximum possible grade')

plt.show()

Let's understand together, once and for all, what the above graph clarifies even more!

Each color represents the overall average achieved by all candidates, as we have seen before in the basic statistics table.

The gray bars? Well, those are like a reminder, showing the maximum possible grade in each subject. A visual way to understand how candidates performed in relation to the overall average, considering the maximum they could achieve. All clear so far?

Note that in Portuguese, we have the highest average in the set. This coincidentally occurs with the fact that it is the subject in which the maximum possible score was also higher, but it is worth noting that there is no direct causal relationship between these two factors, meaning, the maximum one could achieve in Portuguese is 5.0, and this is the highest grade that could be achieved when compared to all subjects in the test.

In [14]:
# Data for the number of questions per subject
questions = {
    'Subject': ['Portuguese', 'Math', 'CompSci', 'Legislation', 'ECA'],
    'Number of Questions': [20, 5, 5, 8, 2],
}

# Calculate percentage of correct answers based on average and number of questions
data['Percentage Average'] = [(m / (q * 0.25)) * 100 for m, q in zip(data['Average'], questions['Number of Questions'])]

# Bar Chart for Percentage of Correct Answers
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(data['Subject'], data['Percentage Average'], color=['blue', 'orange', 'green', 'red', 'purple'])

# Add labels and title
ax.set_title('Percentage of Correct Answers per Subject')
ax.set_xlabel('Subject')
ax.set_ylabel('Percentage of Correct Answers')

# Add values above the bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom')

# Add grid
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Adjust y-axis limits
plt.ylim(0, 100)

plt.show()

For the percentages of correct answers, as we saw, ECA is way ahead compared to Portuguese (and all other subjects). Portuguese is a bit "lower," right?

The graph really shows us what we had considered earlier, and it includes the exact percentage values.

Another point I want to consider is how much subjects like Portuguese and Mathematics have among the lowest percentages of correct answers. Let's highlight a discrepancy between common perception and actual results:

"Take that job entrance exam, it's high school stuff, all easy stuff!!!" Have you heard that before? I have, a lot!!!

Well, the revealed percentage of correct answers on the graph surprises us! Contrary to the common perception that subjects like Portuguese and Mathematics, considered 'basic' in high school, would be easier, the numbers show us another reality. ECA, for example, leads comfortably, demonstrating that familiarity doesn't always translate into success. Subjects like legislation, often underestimated, have proven in this study to be the highest-performing subject.

These results point to something like the importance of valuing and staying updated in all areas, including those considered more basic. After all, the test reveals that what may seem 'easy' does not always correspond to the reality of the percentage of correct answers.

We must take into account, and we cannot forget that the number of questions for each subject was different; but if the difficulty of the questions were maintained, if the candidates were in the same conditions as they were on the day they took the test, and the number of questions were identical for each subject, would the result be very different? A good question, isn't it? And it is entirely coherent, but let's not dwell on that and leave it for another work, but remember that a dataset can reveal much more and leaves us able to explore larger and more complex questions.

Top 17 Students¶

At this moment, we are interested in the "top candidates". The notice provided for 17 vacancies, with 14 for general competition, 1 for people with disabilities, and 2 for Afro-descendants.

We will disregard the quotas since we do not have sufficient information, with the data used for our analysis, to list the candidates who could compete for these positions. Additionally, we will disregard age as a tiebreaker criterion, as outlined in the notice:

  • 11.1 – In case of a tie in the final score, the following criteria will be adopted for tiebreaking successively:
    • a) older age, among candidates aged 60 or older on the last day of registration for this Public Tender, according to article 27, sole paragraph of the Elderly Statute, Law No. 10,741, of October 1, 2003;
    • b) higher score in the objective questions of Portuguese Language;
    • c) higher score in the objective questions of Basic Knowledge of Legislation;
    • d) higher score in the objective questions of Mathematics;
    • e) older age, excluding candidates already covered by the criterion defined in item 'a'.

Now, based on the data in hand, we will indeed investigate which students had higher scores and could be among the 17 who would occupy the available positions. We need to follow some steps, and we will start by analyzing how many students have a score greater than or equal to 6.0. Why? Because the minimum score to be approved, according to the notice, needs to be greater than or equal to 6.0, so everyone within this range is approved.

In [15]:
# Checking the number of scores greater than or equal to 6.0
major_or_equal_6 = df['nota_final'] >= 6.0
major_or_equal_6.sum()
Out[15]:
1735

There are 1735 candidates who obtained this score, so we need to check what the scores, the actual numbers, are within this range:

In [16]:
# Verificando os valores únicos dentro do intervalo >=6.0
unique_values = df.loc[major_or_equal_6, 'nota_final'].sort_values(ascending=False)
unique_values.unique()
Out[16]:
array([9.  , 8.75, 8.5 , 8.25, 8.  , 7.75, 7.5 , 7.25, 7.  , 6.75, 6.5 ,
       6.25, 6.  ])

Highlighting what we have there:¶

9, 8.75, 8.5, 8.25, 8, 7.75, 7.5, 7.25, 7, 6.75, 6.5, 6.25, 6

The values presented are the unique scores within the interval >=6.0. Now, we need to know the quantity of each score. Notice that we have scores like 9.0, 8.75, and so on, but we still don't know how many values we have for each. So, let's also check that:

In [17]:
# Checking the quantity of each unique score
counts = unique_values.value_counts().sort_values(ascending=True)

# Transforming the score counts into a DataFrame
df_counts = pd.DataFrame(counts).reset_index()
df_counts.columns = ['nota_final', 'quantity']

# Displaying the DataFrame
df_counts
Out[17]:
nota_final quantity
0 9.00 2
1 8.75 2
2 8.50 8
3 8.25 22
4 8.00 47
5 7.50 64
6 7.75 71
7 7.25 140
8 7.00 178
9 6.75 194
10 6.50 265
11 6.25 357
12 6.00 385

Vou criar um gráfico em que o eixo horizontal representa cada nota e o vertical representa a quantidade de cada uma dessas notas, apenas para ficar mais visual.

In [18]:
# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(df_counts['nota_final'], df_counts['quantity'], color='blue', width=0.2)

# Add labels and title
plt.xlabel('Final Grade')
plt.ylabel('Quantity')
plt.title('Frequency of Final Grades')

# Add values above the bars
for i, v in enumerate(df_counts['quantity']):
    plt.text(df_counts['nota_final'].iloc[i], v + 10, str(v), ha='center', va='bottom')

# Add grid
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()

As we can see in the table (and chart) above, we have 2 candidates who scored 9.0 and 2 candidates who scored 8.75. Just out of curiosity, at first, let's see who these candidates are, their names.

In [19]:
# Filtering the original DataFrame for scores 9.0 and 8.75
specific_note = df[df['nota_final'].isin([9.0, 8.75])]

# Displaying only the columns "candidato" and "nota_final"
result = specific_note[['candidato', 'port', 'mat', 'inf', 'leg', 'eca', 'nota_final']]

# Resetting the index
result = result.reset_index(drop=True)

# Displaying the result
result.sort_values(by='nota_final', ascending=False)
Out[19]:
candidato port mat inf leg eca nota_final
2 Jefferson Hyan Ferreira 4.50 1.25 1.00 1.75 0.5 9.00
3 Jonas Luís Rockenbach 4.25 1.25 1.25 1.75 0.5 9.00
0 Alan Christian Gimenez 4.25 1.25 1.00 1.75 0.5 8.75
1 Jackeline Santos Neves da Silva 4.25 0.75 1.25 2.00 0.5 8.75

Com esse "pequeno conjunto de candidatos" fica fácil saber qual deles está em primeiro lugar de acordo com os critérios de desempate, no entanto quando começamos a avaliar as outras notas da tabela, o "trabalho braçal" pode não ser tão produtivo e muito menos intuitivo; então precisamos de uma abordagem mais sistemática que calcule as notas finais, os critérios de desempate e logo depois posicione os candidatos em uma lista. Vamos revisar os critérios de desempate com os quais estamos lidando:

  • b) maior pontuação nas questões objetivas de Língua Portuguesa;
  • c) maior pontuação nas questões objetivas de Conhecimento Básico de Legislação;
  • d) maior pontuação nas questões objetivas de Matemática;
In [20]:
def calculate_placement(df, cutoff_score=6.0):
    # Create a copy of the original DataFrame
    higher_or_equal_scores = df[df['nota_final'] >= cutoff_score].copy()

    # Calculate final scores with tie-break criteria
    higher_or_equal_scores['nota_final_with_tiebreak'] = (
        higher_or_equal_scores['port'] +
        higher_or_equal_scores['mat'] +
        higher_or_equal_scores['inf'] +
        higher_or_equal_scores['leg'] +
        higher_or_equal_scores['eca']
    )

    # Sort the DataFrame by final scores with tie-break criteria in descending order
    higher_or_equal_scores = higher_or_equal_scores.sort_values(
        by=['nota_final_with_tiebreak', 'nota_final', 'port', 'leg', 'mat'],
        ascending=[False, False, False, False, False]
    )

    # Add a column with preliminary placement
    higher_or_equal_scores['preliminary placement'] = range(1, len(higher_or_equal_scores) + 1)

    # Set 'colocacao_preliminar' column as the index
    higher_or_equal_scores.set_index('preliminary placement', inplace=True)

    # Select the desired columns in the final DataFrame
    final_result = higher_or_equal_scores[['candidato', 'port', 'mat', 'inf', 'leg', 'eca', 'nota_final']]

    return final_result

# Example of using the function
df_final_result = calculate_placement(df)
df_final_result
Out[20]:
candidato port mat inf leg eca nota_final
preliminary placement
1 Jefferson Hyan Ferreira 4.50 1.25 1.00 1.75 0.5 9.00
2 Jonas Luís Rockenbach 4.25 1.25 1.25 1.75 0.5 9.00
3 Jackeline Santos Neves da Silva 4.25 0.75 1.25 2.00 0.5 8.75
4 Alan Christian Gimenez 4.25 1.25 1.00 1.75 0.5 8.75
5 Naylor Moreira Batista 4.25 0.75 1.00 2.00 0.5 8.50
... ... ... ... ... ... ... ...
1731 Matheus Luiz Pauka Siquieri 1.75 0.75 1.25 1.75 0.5 6.00
1732 Maycon Jose Marcelino 1.75 0.75 1.25 1.75 0.5 6.00
1733 Guilherme Cesar Ferreira Furtado 1.75 1.25 1.00 1.50 0.5 6.00
1734 Alexandre Carlos de Alencar Correa 1.50 0.75 1.25 2.00 0.5 6.00
1735 Gabriela de Almeida Barbosa 1.50 1.00 1.25 1.75 0.5 6.00

1735 rows × 7 columns

Something we haven't discussed before: notice that there is a jump between the value 5 and 1731 (if you observed carefully, you'll see the same "jump" in other tables presented earlier). The data analysis program gives us only a glimpse of the total, but believe me, all the candidates are in this set, okay? Now we have a table with the list of candidates who scored 6.0 or higher, as well as their ranking presented in the first column, considering the highest final score and the tiebreaker criteria. Let's list the top 17 candidates.

In [21]:
df_top_17 = df_final_result.head(17)
df_top_17
Out[21]:
candidato port mat inf leg eca nota_final
preliminary placement
1 Jefferson Hyan Ferreira 4.50 1.25 1.00 1.75 0.50 9.00
2 Jonas Luís Rockenbach 4.25 1.25 1.25 1.75 0.50 9.00
3 Jackeline Santos Neves da Silva 4.25 0.75 1.25 2.00 0.50 8.75
4 Alan Christian Gimenez 4.25 1.25 1.00 1.75 0.50 8.75
5 Naylor Moreira Batista 4.25 0.75 1.00 2.00 0.50 8.50
6 Elisangela Pacheco da Silva 4.25 0.50 1.25 2.00 0.50 8.50
7 Yukiko Sakomoto Belem 4.25 0.50 1.25 2.00 0.50 8.50
8 Felipe Pfeffer 4.25 1.00 1.00 1.75 0.50 8.50
9 Thais Nicoletti Silva 4.25 1.00 1.25 1.50 0.50 8.50
10 Allan Sorrilha Meira Barros 4.00 1.25 1.00 1.75 0.50 8.50
11 Vanessa Carvalho Fenelon 4.00 1.25 1.00 1.75 0.50 8.50
12 Bruna Barbara Polizer 4.00 1.00 1.25 1.75 0.50 8.50
13 Mateus Henrique Aparecido Primilla 4.25 0.50 1.25 2.00 0.25 8.25
14 Pedro Henrique de Souza Marques 4.25 1.00 1.00 1.50 0.50 8.25
15 Renan Constantino Colli 4.25 1.00 1.00 1.50 0.50 8.25
16 Robson dos Santos Mendonça 4.25 1.00 1.00 1.50 0.50 8.25
17 Gabriel Weber Maximowski 4.25 0.75 1.25 1.50 0.50 8.25

Above, then, we have the names of those who will work for the state, at least according to the result of this preliminary score analysis :).

I'll create another chart just for us to visualize at what level a candidate ranked compared to another in each subject. Each bar represents a subject, and the vertical axis represents the score for each subject (remember that the maximum score for each subject is different, as we saw in a previous chart).

The created chart will be interactive. Each color represents a specific subject, as indicated in the legend next to the chart ("Discipline"). Initially, with the stacked bars, the chart doesn't make it easy to compare performance in each subject. However, we have an "effect": if we hover over the bars with our mouse, we'll see a "floating" box with the candidate's name, the subject's name, and the score achieved. In the "Discipline" box, on the side of the chart, we can "hide the subjects" by clicking on any of the colors; that subject will disappear from the chart. If we click again on the color that disappeared, it will reappear on the chart.

Try leaving only one color visible; it's much easier to compare performance between candidates that way. Give it a try and tell me if it isn't.

In [22]:
def create_bar_chart(df, title, user_index=None, user_name=None):
    # Select the first 17 rows of the resulting DataFrame
    df_top_17 = df.head(17).copy()

    # Add a row to represent the user's performance
    if user_index is not None:
        df_user = df[df.index == user_index].copy()
        df_user['candidato'] = user_name
        df_combined = pd.concat([df_top_17, df_user])
    else:
        df_combined = df_top_17

    # Create an interactive plot with Plotly
    fig = px.bar(
        df_combined,
        x='candidato',
        y=['port', 'mat', 'inf', 'leg', 'eca'],
        title=title,
        labels={'value': 'Nota', 'variable': 'Disciplina'},
        hover_name='candidato',
        template='plotly',
    )

    # Add labels and a title
    fig.update_layout(
        xaxis_title='Aluno',
        yaxis_title='Nota',
        yaxis=dict(visible=False)  # Esconde o eixo y
    )

    # Adjust the size of the sidebar
    fig.update_layout(
        margin=dict(l=50, r=20, t=40, b=20)  # Ajustar as margens
    )

    # Increase the width of the user's bar, if applicable
    if user_index is not None:
        df_combined.loc[df_combined.index == user_index, 'candidato'] = user_name
        fig.update_traces(marker=dict(color='rgba(255, 0, 0, 0.7)'), selector=dict(name=user_name))

    # Display the interactive plot
    fig.show()

# Show the first plot
create_bar_chart(df_top_17, 'Performance comparison by subject of the Top 17 candidates')

Let's see in what position I ended up on the list after the tiebreaker criteria were applied:

In [23]:
df_final_result[df_final_result['candidato'] == 'Jeanco Mateus de Oliveira Volfe']
Out[23]:
candidato port mat inf leg eca nota_final
preliminary placement
63 Jeanco Mateus de Oliveira Volfe 3.5 0.75 1.25 2.0 0.5 8.0

My preliminary placement is 63.

Now I want to put my performance alongside the top performers, it's not going to be a very joyful thing to do, but... hehe Here it goes, THE LAST COLUMN OF THE GRAPH represents my performance.

In [24]:
create_bar_chart(df_final_result, 'Notas por Disciplina para os Top 17 Alunos e Meu Desempenho em relação a todos eles', user_index=63, user_name='Jeanco Mateus de Oliveira Volfe')

I can't say I hit the mark, as there were few spots compared to my preliminary position (maybe I'll do another project with the final result), but I'm definitely improving with each exam I take and getting better every day of study. This is reflected in the score I achieved and in my growing confidence, knowing what I know. If I know, I know I know; and if I don't, then I have to guess (if it's CESPE, depending on the situation, it's not an option - those who understand will get it! lol). If I make a mistake thinking I got it right, I'll study to make fewer mistakes in the future.

Congratulations to everyone who really studied and competed in this exam, and congratulations to those who achieved a good preliminary score. If you're just starting in the world of competitions and got a very low score, I can say, don't get discouraged. I've been there, and I know how it is (and I understand that everyone has a different and often challenging reality). But with every summary, every page read, every "pen strike" and exam, you're getting closer to being one of the top performers in my future performance charts...

Be (keep being) excellent!

Final Thoughts¶

As I wrote this piece, I pondered how privileged I am to exist in a time with so many optimizing technologies. I was able to perform these simple analyses in about 5 hours of work, considering the total hours over two days. I must acknowledge the power of tools like ChatGPT as well as the contribution of the data analysis community, which provides free videos, improves documentation, teaches, and more. A more experienced analyst might tell me this is awful, and I wouldn't know what is truly awful. I've been seeking, reading about the subject, asking questions, taking courses, and studying every day to improve. So, I'm grateful to those who can contribute positively.

The work fulfills its purpose in demonstrating some aspects of exam scores, subjects, and candidates. It certainly opened my mind to much more than I could have accomplished.

The analysis not only provided me with a deeper understanding of the exam results but also reinforced my commitment to continuous learning. Furthermore, I believe everyone should have some power to analyze data to have more accurate information. It's fascinating when we can discover something simple that we thought was one way but turns out to be diametrically opposite when we take an honest and "educated" approach using data.

I thank the online community, valuable courses, and ChatGPT (OpenAI) for their contribution to this journey. As I progress, I look forward to applying these skills to more complex challenges and continuing to explore the vast world of data analysis. May this small journey inspire others to discover the power of analysis and to transform data into meaningful information that encourages reflection and the change we desire in the world.